In this part of the course, we will cover the following concepts:
| Objective | Complete |
|---|---|
| Summarize the concept of topic modeling | |
| Describe the process of LDA |
What can be done with a seemingly crude approach as the bag-of-words?
Quite a few things, actually! They include:
snippetsnippet column from
NYT_article_data.csvHow does LDA fall into the category of unsupervised learning?
snippet: topic modelingSo far, the steps we have taken are:
snippet columnWe have our final transformation of our processed documents from
corpus_tfidf
The next step is to find out what topics seem to stand out within these documents
We can find a solution to both of these statements by running an LDA model on the corpus
| Objective | Complete |
|---|---|
| Summarize the concept of topic modeling |
✔ |
| Describe the process of LDA |
Latent Dirichlet Allocation (LDA) is a popular algorithm for
topic modeling for many reasons, it allows us to:
The algorithm is summarized in three steps:
Here is the original paper written on the algorithm by David M. Blei, Andrew Y. Ng and Michael I. Jordan
Let’s use a simple corpus as an example
It consists of three documents which are actually just
three sentences
What do you think LDA will do with these documents?
LDA could:
LDA is actually defining each of these documents as a bag-of-words and you then label the topics as you see fit
Remember how we applied the TF-IDF transformation to each document?
This will help you understand why there is a benefit of LDA defining topics on a word level
We can infer the content spread of each sentence by a word
count:
We can derive the proportions that each word constitutes in given topics
LDA might produce something like:
Topic A might comprise words in the following proportions: 40% bananas, 20% ate, 20% salad, 20% munch
Topic B might comprise words in the following proportions: 25% cat, 25% hamster, 25% dog, 25% cute
Let’s go back to the three steps of LDA
Now, instead of three sentences, let’s imagine we have two
documents with the following words:
| Document 1 | Document 2 | ||
|---|---|---|---|
| dog | dog | ||
| dog | dog | ||
| cat | hamster | ||
| bananas | munch | ||
| cat | salad |
The first step is that we tell the algorithm how many topics you think there are, this is usually based on:
In trying different estimates, you may pick the one that generates topics to your desired level of interpretability
In our example, we can probably guess the number of topics by eyeballing the documents, since they are tiny
We will guess that there are two topics
The second step is when the algorithm assigns every word in
each document to a temporary topic
Let’s look at how topics have been assigned in our small example, remember we are dealing with topic A and topic B
| Document 1 | Document 2 | ||
|---|---|---|---|
| B | dog | ? | dog |
| B | dog | B | dog |
| B | cat | B | hamster |
| A | bananas | A | munch |
| B | cat | A | salad |
Step 3 is the iterative step of the algorithm, where topics are checked and updated as the algorithm loops through each word in every document
The algorithm is looking at two main criteria:
Remember the question marked item in Document 2 from step 2?
We will now see how the algorithm iterates and updates the topic for the ? from step 2, and the assignment for dog in document 2
How prevalent is the word across topics?
| Document 1 | Document 2 | ||
|---|---|---|---|
| B | dog | ? | dog |
| B | dog | B | dog |
| B | cat | B | hamster |
| A | bananas | A | munch |
| B | cat | A | salad |
How prevalent are the topics in the document?
| Document 1 | Document 2 | ||
|---|---|---|---|
| B | dog | ? | dog |
| B | dog | B | dog |
| B | cat | B | hamster |
| A | bananas | A | munch |
| B | cat | A | salad |
| Document 1 | Document 2 | ||
|---|---|---|---|
| B | dog | B | dog |
| B | dog | B | dog |
| B | cat | B | hamster |
| A | bananas | A | munch |
| B | cat | A | salad |
| Objective | Complete |
|---|---|
| Summarize the concept of topic modeling |
✔ |
| Describe the process of LDA |
✔ |